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An 0(log N) parallel algorithm for computing the 
eigenvalues of a symmetric tridiagonal matrix 


by 

Paul N. Swarztrauber 1,1 
December 1989 


ABSTRACT 

An O(log N) parallel algorithm is presented for computing the eigenvalues of a 
symmetric tridiagonal matrix using a parallel algorithm for computing the zeros 
of the characteristic polynomial. The method is based on a quadratic recurrence 
in which the characteristic polynomial is constructed on a binary tree from poly- 
nomials whose degree doubles at each level. Intervals that contain exactly one 
zero are determined by the zeros of polynomials at the previous level which 
ensures that different processors compute different zeros. The exact behavior of 
the polynomials at the interval endpoints is used to eliminate the usual problems 
induced by finite precision arithmetic. 
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1. Introduction. A parallel algorithm is presented for computing the eigen- 
values of a symmetric tridiagonal matrix. 


A c i 


C 1 ^2 C 2 


A = 


1 


c N-l b N 


( 1 . 1 ) 


If c. = 0 the problem can be reduced to two independent eigenproblems and 
hence it is customary to assume that c f . + 0. However, we do not make use of 
this fact and the algorithm is valid for near multiple or multiple zeros. We begin 
with a brief review of existing methods. In 1981, Cuppen developed a method 
based on the splitting 
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where Tj and T 2 are symmetric tridiagonal matrices and C has one nonzero ele- 
ment c Nn in its lower left hand corner. The eigenvalues of and T 2 can be 
computed in parallel followed by an update procedure in which the eigenvalues 
of A are computed from the eigenvalues of Tj and T 2 . This approach can be 
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applied recursively by splitting Tj and T 2 and so forth until matrices of order 
one are obtained. The parallelism is now evident in the recursive process of com- 
puting the eigenvalues, of successively larger matrices from those of smaller 
matrices. This approach has been analyzed extensively by Dongarra and Soren- 
son, 1987, who solved many of the practical problems that arise in its implemen- 
tation and demonstrated that zero finding provides a viable technique for com- 
puting eigenvalues. In particular they used deflation to improve reliability and 
performance (Bunch, Nielson, and Sorenson, 1978). 

Eigenvalues have also been computed as the zeros of the characteristic polyno- 
mial of A. A variant of bisection, called multisectioning was recently developed 
and analyzed (Lo, Philippe, and Sameh, 1987). Sturm sequences are developed 
on multiple subintervals to isolate the eigenvalues. The eigenvalues are then 
determined by bisection or Newton’s method or the zero-in method that com- 
bines bisection and the secant method. Multisectioning is particularly appropri- 
ate if only a few eigenvalues and/or eigenvectors are of interest. Ipsen and 
Jessup, 1989, have made a detailed comparison of Cuppens method, bisection and 
multisectioning on the hypercube. 

In this paper, the eigenvalues are also computed as the zeros of the characteristic 
polynomial. The polynomial is evaluated on a binary tree structure using a qua- 
dratic recurrence in which the degree of the polynomials doubles at each step 
(Swarztrauber, 1979). Using N processors, the characteristic polynomial can be 
evaluated in O(logiV) time. Krishnakumar and Morf, 1986, also use this qua- 
dratic recurrence to compute the eigenvalues of a symmetric tridiagonal matrix 
in O(NlogN) time. Their method of separating the zeros is different from the one 
presented here. 

A fundamental problem in parallel zero finding is to ensure that different proces- 
sors compute different zeros. We show that the zeros of the polynomials at any 
step in the quadratic recurrence are separated by the zeros of the polynomials at 
the previous step. Hence the zeros can be determined by recurrsion beginning 
with the single zeros of linear polynomials and ending with the zeros of the 
characteristic polynomial. The precise nature of the separation is given in section 
3. 

There are two other problems associated with the implementation of zero finding 
methods for characteristic polynomials, both of which are associated with the use 
of finite precision arithmetic. First, even if c i # 0, the zeros may not be 
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separated because of rounding errors. A procedure is presented in section 5 that 
permits the automatic detection and handling of these cases without a machine 
dependent test. Second, for large N and/or I IAI I, the possibility of 
over/underflow exists when computing the characteristic polynomial. Usually 
this can be handled by an apriori scaling of A but a more reliable approach is 
required if the algorithm is used in general purpose software. 

An acceptable alternative is to use dynamic parallel scaling in which the inter- 
mediate computations are maintained at 0(1), (Swarztrauber, 1979). Another 
alternative is to reformulate the problem in terms of self-scaling rational func- 
tions. Both alternatives are presented in section 5 and their accuracy is com- 
pared with the QR algorithm in section 6. 


2. A Parallel Algorithm For Evaluating the Characteristic Polynomial. 
The characteristic polynomial will be computed in terms of characteristic polyno- 
mials d i . of submatrices consisting of rows (and columns) t through j of A. 
Expanding about the j th row we obtain the well known three term recurrence 
relation 
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we obtain the two term matrix recurrence 
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To solve this recurrence relation we define 


(2.4) 


%i m 11 
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Next we show that (2.4) has the closed form 


Q = 


d. . 


c d- ■ , 

; «,;-i 


C i - 1 + 1,/ C i - 1 e j + l,y - 1 


(2.5) 


with elements that can be determined from the characteristic polynomials of four 
submatrices. The desired characteristic polynomial d l N is given as the upper left 
element of Q x N . The proof of (2.5) is by induction on j. Equation (2.5) can be 
verified by direct computation for j = t + 1. If we define d i+l i = = 1 and 

d,- +1 ,•_! = 0 then equation (2.5) is also true for j= t . Now assume that it is true 
for j— 1, then 




( 2 . 6 ) 


After matrix multiplication, (2.1) can be used with (2.6) to verify (2.5) which 
completes the proof. 

The associative property of matrix multiplication provides a splitting formula 
that is fundamental to the parallel algorithm. For any k 


- Q 1,*Q* + I,r 


(2.7) 
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Consider now the parallel algorithm for computing Q, l)N for the case N — 8. 
Step 1 Compute 


Q 


2« — 1,2» 




' 2 f — 1 


c 2t-2 ® 


b 2i ~\ 


'2i 


C 2i-I ® 


( 2 . 8 ) 


for i = 1,2,3, and 4. 
Step 2 Compute 


Ql,4 = Ql, 2^3,4 and 


^5,8 Q4,sQ«,7. 


(2.9) 


Step 3 Compute 


Q 


1,8 


Ql, 4^5,8. 


( 2 . 10 ) 


The computations within each step can be performed simultaneously. For gen- 
eral N, the first step requires 5 multiplications and 2 additions per matrix multi- 
plication for a total of 3.5iV flops. The rth step requires 12iW2 r flops for 
r = 2,...,log 2 iV or about 6N flops. Therefore on a single processor, the computa- 
tion of the characteristic polynomial requires 6.5 N flops. With N processors the 
first step requires 4 flops (since 2 processors are available for each matrix multi- 
ply), 3 flops are required for the second step and 2 flops are required for each 
step thereafter. A minimum of two flops are required since the additions must 
follow the multiplications. Hence a total of 2log 2 iV+3 flops are required to 
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compute the characteristic polynomial using N processors. In the sections that 
follow we will use the element-wise form of (2.7), which is obtained by substitut- 
ing (2.5) into (2.7). 




d i,k d k+l,j 


C k d i,k-l d k+2,j. 


(2.11a) 


4 -,/- 


«,£ “i+i.y-i 


C k d i,k-l d k + 2,j-l. 


(2.11b) 


+ d i+l,k d k + \,j 


C k d i + l,k-l d k + 2,j. 


(2.11c) 


2 

d i + l,j — 1 = d i + l,k d k + l,j-l — C k d i + l,k-l d k + 2,j-l. (2. lid) 

Equations (2.11b) through (2. lid) are the same as (2.11a) but with a suitable 
shift in the subscripts j and j. Nevertheless all four equations are required to 
"close" or complete the recurrence relations. 

The parallel algorithm developed in this section can be used to solve a general 
tridiagonal system of equations (Swarztrauber, 1979). It has also been used 
(Kxishnakumar and Morf, 1986) as part of a parallel algorithm for computing the 
eigenvalues of a symmetric tridiagonal matrix. Their method of separating the 
zeros is different from the one given in this paper. 


3. The Separation Theorem. The fundamental problem with zero finding on 
a multiprocessor is ensuring that different processors compute different zeros. In 
this section we will show that if \ l are the collated zeros of the four characteristic 
polynomials on the right side of equation (2.11a) then the zeros of d. ■ occur one 
per interval in every other interval (k 2 /A 2 /+i)- This result also holds for (2.11b) 
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through (2. lid) but it will only be demonstrated for (2.11a) since the proofs are 
almost identical. The separation theorem is fundamental to the parallel algorithm 
and ensures that each processor will find a different zero. To prove the theorem 
and develop the precise nature of the separation we will need the following three 
lemmas. 


Lemma 1 

Let p(x) = p Q + • • * + p k x k and q(x ) = q 0 + * * • + 1 be polynomials 

with real and strictly interlacing zeros. If p k and q k _ k have the same sign then 
r(z) = p(x)/q(x) is monotone increasing on any interval that does not contain a 
zero of g(z). If they have the opposite sign then r(x) is monotone decreasing. 
More specifically, if p k /q k _ l > 0 then r' (x) > P k f%-y If p k /q k _ l <0 then 

r ' (z) < Pk f %-v 


Proof: The partial fraction expansion of r(x) is 


p k Pk-i^k-rPkh-2 k ~ l w i 

r (x) = z + + 2 • 

«»-l 1-1 (* - 


(3.3) 


where a, are the zeros of q(x ) and w t = p(al)/q' (otj). Therefore 


r '(x) = 


Pk 


k-1 

E ‘ 


w, 


q k - 1 /=o (* - a /) 


(3.4) 


If p k and q k-l 
sign[?' (a,)] = 


are both greater than zero then sign[p(otj)] = sign( — 1)* * + l and 
sign( — 1) ; * and hence w, < 0. This result can be combined with 
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(3.4) 

to complete the proof of Lemma 1 for this case. The remaining three cases are 
handled in a similar fashion. 


Lemma 2 

Let i2(x) ~ r i( I ) r 2 ( I ) — where r 1 ( I ) = p^(z)/fl^(*) and 

r 2 (x) = p^ 2 \x)/q^ 2 \x) are like r(x) in Lemma 1. Also let c # 0 be real and \ l 
be the zeros of p^(x), g^(x), p^(x) and qr^(x) which are assumed to be real, 
distinct, and ordered. Then R{x ) can not have a zero in both adjacent intervals 
and (XpXj + j). 


Proof: Only one of the polynomials p 1 (i), P 2 ( x ) or ? 2 ( z ) changes sign at 

Xj and hence r 1 (x)r 2 (x) must be negative on one of the intervals which therefore 
does not contain a zero of R(x). 


Lemma 3 

Let R[x) be defined as in Lemma 2 where the degrees of r 1 (x) and r 2 (x) are / 
and m respectively. Also let the high order coefficients in the polynomials satisfy 
signfp/ 1 ^/- 1 1 — sign[p^ 2 V^ 2 l. ] . Then R[x) does not have more than one zero 
in any interval (XA, + .). 


Proof: From Lemma 1, r' t (x) and r' 2 (x) have the same sign. If 12(a) is zero 
then r t (x) and r 2 (x) also have the same sign. From the chain rule for 
differentiation r 1 (x)r 2 (x) must be monotone which implies that a is a unique 


zero. 
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The four sets of eigenvalues belong to the four characteristic polynomials listed 
on the left side of equations (2.11a-d). The eigenvalues of A are computed in the 
first of the four steps listed above when r = s and hence the remaining three 
sets do not have to be computed. The amount of computation doubles when r 
increases by one until r = s when it is halved. Therefore the amount of compu- 
tation for the intervals that contain the eigenvalues of A is about four times the 
amount that would be required to compute the eigenvalues of A if the intervals 
were known. 

Let n g be the maximum number of polynomial evaluations that are necessary to 
determine a zero. For each r = l,...,log 2 N a total of about AN zeros of charac- 
teristic polynomials with degree 2 r must be computed. From section 2, 6.5 - 2 r 
flops are required to evaluate a polynomial on a single processor. Therefore a 
total of 

log 2 W 

TF X = 26n e N'Z 2 r < 52 n g N 2 (4.1) 

r = 1 

flops are required to compute the eigenvalues of a symmetric tridiagonal matrix 
on a single processor (using the parallel algorithm for evaluating the polynomials. 
With N processors each zero can be determined on a separate processor for a 
total flop count of 

TF n < 52n e N. (4.2) 

With N processors a characteristic polynomial with degree 2 r can be computed 
with 2r + 3 flops using the algorithm given in section 2. In addition each zero can 
be determined on a separate processor so the total number of flops is 

log 9 N 

TF n > An e £ (2r + 3) < 4n e log 2 N. (4.3) 

r = l 

This is the count for computing the eigenvalues of a symmetric tridiagonal 
matrix with N processors. Note that the four polynomials in (2.11a-d) could be 
evaluated independently with additional processors for a minimum of n e log 2 N 
flops. 


In the development of the operation counts we have, in the traditional way, over- 
looked what could be a significant contribution to the total computing time; 
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namely, the time required for communication. The collation of the zeros of the 
four characteristic polynomials in the separation theorem must also be performed 
in O(log N) time or the overall algorithm cannot be considered to be O(log N). 
The collation of d {k , d k+l d i k _ v and d k+2 . can be done in the following 
steps. 


1. The zeros of d. k and d. k _ l interlace and hence a shuffle can be used to 
combine them in an ordered sequence, say \ . . 

2. Similarly the The zeros of d... ■ and d. , , . can be shuffled to produce an 

K~r 1 1 J K-f- 

ordered sequence, say \> '. 

3. Finally the ordered sequences and and can be merged to form the 
desired collated sequence. 

For a polynomial of degree 2 r each of these steps can be performed with 0(r) 
parallel transmissions and hence the overall time for communication with N pro- 
cessors is proportional to 

log 2 tf 

2 r a log *N (4.4) 

r = 0 

2 

Therefore the overall algorithm including communication is 0(log N ) if the 
architecture of the multiprocessor supports the algorithmic requirements of the 
shuffle and merge. The hypercube and related interconnection topologies support 
both the parallel computation and communication that are implicit in the algo- 
rithms presented here. 


5. Implementation. In this section we will develop both a polynomial and 
rational function implementation of the algorithm. The polynomial implementa- 
tion may require scaling to avoid over/underflow but it is more accurate than the 
rational function implementation. We determine the behavior of the functions at 
the endpoints of the intervals to eliminate the usual problems associated with 
finite precision arithmetic. This also provides both implementations with a high 
level of reliability. The computation of near multiple zeros (or multiple zeros if 
c. = 0) is facilitated by multiple intervals with zero (or near zero length) that 
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provide the correct multiplicity. The purpose of this section is to provide certain 
details of the implementation that are necessary before the algorithm can be con- 
sidered of practical use. 

A. Polynomial Implementation. 

In this implementation the polynomials are computed using the parallel algo- 
rithm that was given in section 2. We do not discuss the zero finding method 
itself other than to note that the method of bisection was used for the results 
presented in section 6. A considerable amount of literature is available on this 
topic and many options exist. 

The reliability of the algorithm is greatly enhanced by knowing the signs of the 
characteristic polynomials at the endpoints of the intervals. Because the eigen- 
value enters the polynomial with a negative sign, i.e. as 6 f . — A. on the diagonal of 
A, the high order term in . . is (-l) ;-, + 1 z J_, + 1 . But /-t' + l=2 r is an even 
integer which combined with the separation theorem, implies that the signs of 
d (j on the intervals \ji:k:j) for /=0,...,2 r + 1 -l are 
+ — — + + •■•+ + — — +. Similarly the signs of d { -_ x on \^i:k:j — 1) for 

l = 0,...,2 1_ 3 are + — — + + ■•■— — + + —. The signs of d,. . on 

, - * *** ) 

X^t + lik'.j'j for /= 0,...,2 —3 are + — — + + ••■— — + + — and the signs 

°f d{+i t y_i Xj(i + l:k:j — 1) for /=0,...,2 r + 1 — 5 are 

+ + + •••+ + +. 

In practice there are two reasons why these sign patterns can be interrupted. 
First, as previously noted, d. k and d^ +2 . are characteristic polynomials of 
different submatrices and may therefore share a common zero that would also be 
a zero of d { .. Then, with finite precision, a small value could be obtained for 
d i . with the wrong sign. Second, although in theory the zeros of d , and d . , 

(or d k+lt j and ,•) interlace, with finite precision they may coalesce or cross 
and again produce a small number with the opposite sign. Both cases occur 
when a zero of the characteristic polynomial has already been found to machine 
precision and it would therefore seem reasonable to select one or the other as a 
zero. However, endpoint zeros might belong to a different interval in the presence 
of near multiple or multiple zeros. The most satisfactory approach is to replace 
any small end point value, whose sign differs from the correct sign, by the 
machine epsilon divided by, say 10, with the correct sign. The bisection method 
or variants thereof can then be used to select the zero. This procedure is 

l 
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fundamental to the reliability of the algorithm and guarantees that N zeros will 
be found. Note that these tests do not require a machine dependent test, i.e. only 
sign tests are required. 

In practice, some of the intervals usually shrink to zero as the computation 
proceeds. This is beneficial because it reduces the amount of computation that is 
required to compute the zeros. This "deflation 11 is similar to that reported by 
Dongarra and Sorenson, 1987, who observed that it could make their method 
competitive with the QR algorithm on a single processor. However there are 
important cases where deflation does not occur, e.g., the tridiagonal matrix with 
bj = —2 and c { = 1. For cases where deflation occurs it is likely that the length 
of the intervals is about machine precision and not identically zero. This could be 
detected using a machine dependent test; however, a more satisfactory approach 
has been to detect interlace faults and set the zeros that have crossed to their 
average value. This creates an environment where the length of many intervals is 
identically zero and detectable without the use of a machine dependent test. 

For large N or MAN the characteristic polynomial may over/underflow the 
arithmetic unit. This can be avoided by scaling the intermediate 2x2 matrices 
that occur during the parallel algorithm, (Swarztrauber, 1979). Scaling is not 
required at each level r so the time required for scaling can be kept to a 
minimum. For example, if scaling is performed for r = 5,10,15,... about iV/32 
matrices are scaled. If "power of two" scaling is used then accuracy is unchanged 
and the time is negligible, particularly if it is done in machine language. Scaling 
can also be eliminated by a reformulation in terms of rational functions. 


B, Rational Function Implementation. 

vv y 

Tfljere'. are three reasons to consider a second implementation of the algorithm. 

. ' V’' 

First, the rational function is self-scaling; second, one can take further advantage 
of deflation to reduce the order of the rational functions; and third, the operation 
count for the rational function implementation is somewhat less than the polyno- 
mial implementation. This must be weighed against a loss of accuracy compared 
with the polynomial implementation. The loss is not substantial since only two 
binary bits are lost when compared with the QR algorithm for a matrix with 
order 1024. Accuracy tables are provided in section 6. 

The eigenvalues of A are computed as the zeros of the rational function fZ(x) 
that was introduced in Lemma 2, section 3, namely 



fl(*) = r l( I ) r 2 ( I ) - c k 


(5.1) 


where r^x) = d ik /d ik _ l and r 2 (x) = d k + l J /d k+2 ■. Ail polynomials are 
evaluated and stored in factored form. Like the polynomial implementation, it 
is necessary, from a practical point of view, to know the behavior of /2(x) at the 
end points of the interval [ctpOtJ under consideration. If a zero of (5.1) exists 
then r 1 (x) and r 2 (z) must have the same sign. They also satisfy the conditions of 
Lemma 1 which ensure that they are both monotone decreasing functions. Since 
the ratio of the high order coefficients is -1 for both r 1 (x) and r 2 (z). These con- 
ditions are satisfied only if d. k or d t + 1 ■ are zero at one end of the interval and 
d{ k~\ or ^k + 2 j are zero ^he °th er which implies that only two cases are possi- 
ble, namely the ones listed in steps 4. and 5. below. Consider now the steps that 
must be taken to ensure a reliable implementation. 

1. Like the polynomial implementation, any interlace faults are corrected by 
replacing the zeros that have crossed by their average value. 

2. Any zeros that are identically common to both the numerator and denomi- 
nator of r 1 (x)r 2 (x) are removed and selected as zeros of d. .. This deflation 
can substantially reduce the amount of computation. A matrix with order 
A7=4096 is presented in section 6 for which the maximum order of any com- 
puted polynomial is 24! 

3. If a. = a, then a. is selected as a zero of d ... This deflation step can be 
used with both implementations. 

4. If a 2 and either d f . k or d k+1 . are zero at then ^(otj is set to —c k 

and R (a 2 ) is set to a large positive number. 

5. If a 2 and either d ( k _ k or d k + 2 ■ are zero at a x then R (aj is set to a 

large positive number and R (a 2 ) is set to — c k . 

Case 4. applies to the first interval and case 5. applies to the last interval but 
with minor modifications. Machine independent tests can be used to determine 
which of the cases 1.-5. apply. Once the values of R (x) are established using 
step 4. or 5. the zeros can be determined using bisection or any variant thereof. 
Note that this does not preclude the use of another method such as the one used 
by Dongarra and Sorenson, 1987, if it is combined with bisection. 
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Although scaling is not required for this implementation, the rational functions 
r 1 (x) and r 2 (x) should be computed as a product of quotients (x — X,-)/(x — P f .) 
where X t - and 3 ( . are chosen as close as possible. This prohibits the growth of 
intermediate computations that might result in over/underflow. 

6. Computational results. The accuracy of implementations A and B are 
compared with the QR algorithm as implemented in subroutine TQL1 in 
EISPACK (Smith, et.al.,1976) for symmetric tridiagonal matrices. The entries in 
the tables below are computed from 

$ 

max I X . — X ( - I 

« = — 7 — (6.1) 

max I X - I 

t 

where is computed in single precision and X. . is computed using a double pre- 
cision version of subroutine TQL1. All the computations were done on the 
CRAY-2 computer located at NASA Ames Research Center. 

Three tables are presented that correspond to three different matrices. Table 1 
contains a comparison of accuracy for a matrix with random coefficients. The 
matrix corresponding to Table 2 is W+ k which is a slight variant of (Wil- 

kinson, 1965, p.308). The variant is used because the parallel algorithms were 
implemented for matrices with order equal to a power of 2. Although this restric- 
tion can be removed it nevertheless simplifies implementation. is of interest 
because it tests the ability of a method to handle near multiple eigenvalues. 
Table three corresponds to the matrix with zeros on the diagonal and ones adja- 
cent to the diagonal. The following observations can be made from the tables. 

1. The accuracy of the polynomial implementation is always superior to the 
accuracy of the QR algorithm. This result can be established by comparing 
the second and third columns in Tables 1,2 and 3. 

2. The accuracy of rational function implementation is also superior to the QR 

algorithm for matrices with random elements or the matrices These 

results can be established by comparing columns two and four in Tables 1 
and 2. 

3. The accuracy of the QR algorithm is superior to the accuracy of the rational 
function implementation for the matrix with zeros on the diagonal and Is in 
the sub and super diagonals. This result can be established by comparing 
columns two and four in Table 3. The difference is not substantial, for 


AO - 


N = 1024 they differ in only the last two bits 



Table 1 


Accuracy of the QR method and two implementations 
of the parallel algorithm for a symmetric 
tridiagonal matrix with random coefficients 


N 

QR 

PI 

RI 

128 

5.67X10" 13 

2.06X10" 14 

3.95X10 -13 

256 

1.25X10" 12 

3.95X10" 14 

2.57X10" 13 

512 

1.82X10 -12 

3.94X10 -14 

7.79X10" 13 

1024 

3.42X10 -12 

4.60X10" 14 

7.79X10 -13 

2048 

7.33X10 -12 

6.94X10 -13 

1.61X10" 12 

4096 

1.32X10" 11 

7.00X10" 13 

2.14X10" 12 


QR as implemented in subroutine TQL1 from EISPACK 
PI is the polynomial implementation 
RI is the rational function implementation 













Table 2 


Accuracy of the QR method and two implementations 
of the parallel algorithm for a symmetric 
tridiagonal matrix with near multiple eigenvalues 


N 

QR 

PI 

RI 

128 

2.63X10" 13 

2.81X10 -14 

1.05X10 -14 

256 

6.36x 10~ 13 

3.53X10 -14 

1.06x10" 14 

512 

1.28X10" 12 

4.25X10 -14 

1.42X10" 14 

1024 

2.55X10 -12 

4.97X10" 14 

1.06X10" 14 

2048 

5.31X10" 12 

5.68X10" 14 

1.07 x 10" 14 

4096 

9.88X10' 12 

6.39X10" 14 

1.42X10" 14 


QR as implemented in subroutine TQL1 from EISPACK 
PI is the polynomial implementation 
RI is the rational function implementation 
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Table 3 


Accuracy of the QR method and two implementations 
of the parallel algorithm for a symmetric 
tridiagonal matrix with zero diagonal and l’s off diagonal 

N 

QR 

PI 

RI 

128 

4.37X10“ 13 

5.33X10 -14 

1.53x 10“ 13 

256 

7.35x 10“ 13 

6.75X10 -14 

6.64X 10“ 13 

512 

1.44X 10“ 12 

1.46X10 -13 

2.38X10 -12 

1024 

2.73X 10“ 12 

2.06X10 -13 

1.04X10 -11 

2048 

5.42X 10“ 12 

3.45X 10“ 13 

2.78X10 -11 

4096 

1.01X10 -11 

6.25X10" 13 

2.24X10 -10 

QR as implemented in subroutine TQL1 from EISPACK 
PI is the polynomial implementation 
RI is the rational function implementation 


The accuracy of the rational function implementation in Table 3 is somewhat less 
than in Tables 1 and 2, probably because deflation does not occur in Table 3. 
Deflation occurred for all the parallel computations in Tables 1 and 2 and 
appears to be the rule rather than the exception. For N = 4096 the eigenvalues 
of the matrix W ^ are given as the zeros of a polynomial of degree 4096; however 
because of deflation, the maximum degree of any computed rational polynomial 
is 24! Similarly, the maximum degree of any computed rational polynomial for 
Table 1 is 75. The eigenvalues of the matrix in Table 3 are \ . = 2cosjtt/(AT+ 1) 
which are separated to the extent that deflation is minimal. Since the 












submatrices are identical some deflation does occur but the maximum degree is 
4096 compared to 24 for the matrix W^ k presented in Table 2. This difference in 
computation may explain the difference between the fourth column of Table 3 
and the fourth column in Tables 1 and 2. 

The results in Table 2 show that near multiple zeros do not present any 
difficulties for the parallel method. Indeed the amount of computation may be 
reduced. Recall that the method produces exactly N intervals that contain one 
and only one zero and therefore, in the presence of near multiple zeros, the inter- 
val length can be near or identically zero. Therefore a zero can be obtained 
directly and with the proper multiplicity without using any zero finding methods. 
For intervals with near zero length the usual concern about slow convergence of 
Newton like methods for multiple zeros is not warranted because: first, the zeros 
are distinct to machine precision and second, any point on the interval provides a 
good initial approximation to the zero. 

Since the interval end points are themselves zeros of characteristic polynomials it 
is reasonable to ask if they are subject to the usual concerns about near multiple 
zeros, i.e. has the problem simply been deferred to the interval endpoints? The 
answer is that they too may be found on adjacent short intervals or on a juxta- 
position of intervals with near zero length. This line of reasoning continues recur- 
sively to the first level where linear polynomials provide the interval endpoints 
for quadratic polynomials. Since near multiple zeros are clearly not a problem at 
this level, by induction they are not a problem at any level. 

The performance of the algorithm on a parallel computer is a complex issue that 
depends on many factors including: 

a) Many methods could be used to determine the zeros within each interval 
including Newtons method, bisection, and the secant method together with 
variants and combinations. Zero finding is itself a significant area of compu- 
tational mathematics. Dongarra and Sorenson, 1987, use a variant of New- 
tons method in which a local quadratic rational approximation to the func- 
tion is computed. The zero-in method (Lo, Philippe, and Sameh, 1987) com- 
bines bisection and the secant method. 

b) Deflation plays a significant role in reducing computing time on either a sin- 
gle or multiprocessor. The effects are two-fold: first, the number of com- 
puted zeros is reduced and second, the order of the rational functions is 
decreased. With deflation the computing time on a single processor for the 
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rational function implementation was proportional to N rather than N . 
Dongarra and Sorenson, 1987, report performance superior to QR algorithm 
in the presence of deflation even on a single processor. However, as evi- 
denced by Table 3, deflation does not always occur which verifies that a 
matrix with random coefficients does not provide a stringent test because it 
probably does not correspond to either the worst or the best case. 

c) The algorithm requires global communication and will therefore perform 
better on parallel computers that provide efficient global parallel pathways 
such as those provided by the hypercube and related interconnection topolo- 
gies. 

d) The algorithm requires the implementation of the shuffle and merge com- 
munication tasks. To maintain an overall complexity of <3 (log N ) these 
tasks must be implemented with parallel communication algorithms on suit- 
able multiprocessor architectures. Although communication does not change 
the overall complexity it is likely to make a significant contribution to the 
overall computing time. 


7. Summary and conclusions. A parallel algorithm has been presented for 
computing the eigenvalues of a symmetric tridiagonal matrix in O(log N ) time 
using N processors or 0(N ) time using N processors. Attributes of the method 
that contribute to reliability and performance are: i) a separation theorem that 
ensures that different processors find different eigenvalues, ii) implementations 
that eliminate the usual problems associated with finite precision arithmetic, iii) 
reliable treatment of multiple and near multiple zeros, and iv) high accuracy. 
Two implementations of the algorithm were presented. The polynomial imple- 
mentation is more accurate than the QR and the rational function implementa- 
tion has comparable accuracy with deflation but is less accurate in the absence of 
deflation. Nevertheless it remains of interest because it is self scaling and takes 
full advantage of deflation both from the standpoint of having to compute fewer 
zeros of rational functions with lower degree. 
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